Skip to content

Conversation

@brightsparc
Copy link

@brightsparc brightsparc commented Jan 1, 2026

This PR adds support for GenAI/LLM observability workloads by extending the existing ClickHouse traces exporter with materialized fields optimized for LLM telemetry queries.

Changes

ClickHouse DDL Tool (clickhouse-ddl)

Added two new flags for traces table creation:

  • --materialize-genai-fields: Creates MATERIALIZED columns that extract GenAI semantic convention attributes from SpanAttributes, enabling efficient querying without parsing JSON on every query.

  • --partition-by-service-name: Optimizes multi-tenant deployments by partitioning data by (ServiceName, toDate(Timestamp)) and ordering by (ServiceName, Timestamp).

Materialized GenAI Fields

When --materialize-genai-fields is enabled, the following columns are automatically extracted:

Column Description
GenAIOperationName Operation type (e.g., "chat", "embeddings")
GenAIProviderName LLM provider (e.g., "openai", "anthropic")
GenAIRequestModel Requested model name
GenAIResponseModel Actual model used in response
GenAIInputTokens Input token count
GenAIOutputTokens Output token count
GenAISystemFingerprint System fingerprint for reproducibility
GenAILastInputMessage Last input message (for quick access)
GenAILastOutputMessage Last output message (for quick access)

Python Processors

Added genai_redaction_processor.py for redacting PII in GenAI message attributes:

  • gen_ai.input.messages
  • gen_ai.output.messages

Features:

  • Pattern-based redaction (emails, SSNs, phone numbers, credit cards, IPs)
  • Tool field redaction (passwords, API keys, secrets)
  • Optional SHA256 hashing instead of [REDACTED]
  • Debug/info/silent summary modes

Usage

Creating Tables

# Create traces table with GenAI materialized fields and multi-tenant partitioning
clickhouse-ddl create \
  --endpoint http://localhost:8123 \
  --database otel \
  --traces \
  --materialize-genai-fields \
  --partition-by-service-name

Running with PII Redaction

PYTHONPATH=/path/to/rotel_python_processor_sdk \
rotel start \
  --otlp-grpc-endpoint 0.0.0.0:4317 \
  --otlp-with-trace-processor /path/to/your_redaction_processor.py \
  --exporter clickhouse \
  --clickhouse-exporter-endpoint http://localhost:8123 \
  --clickhouse-exporter-database otel

Querying GenAI Data

-- Get recent chat completions with token usage
SELECT 
    ServiceName,
    GenAIProviderName,
    GenAIRequestModel,
    GenAIInputTokens,
    GenAIOutputTokens,
    GenAILastInputMessage,
    GenAILastOutputMessage
FROM otel.otel_traces 
WHERE GenAIOperationName = 'chat'
ORDER BY Timestamp DESC 
LIMIT 10;

-- Token usage by provider and model
SELECT 
    GenAIProviderName,
    GenAIRequestModel,
    sum(GenAIInputTokens) as total_input_tokens,
    sum(GenAIOutputTokens) as total_output_tokens,
    count() as request_count
FROM otel.otel_traces
WHERE GenAIOperationName = 'chat'
GROUP BY GenAIProviderName, GenAIRequestModel
ORDER BY total_output_tokens DESC;

Files Changed

  • .gitignore - Added Python cache patterns
  • src/bin/clickhouse-ddl/ddl_traces.rs - Added GenAI materialized columns generation
  • src/bin/clickhouse-ddl/main.rs - Added CLI flags
  • src/bin/clickhouse-ddl/README.md - Updated documentation
  • src/init/args.rs - Minor updates
  • src/exporters/clickhouse/mod.rs - Minor updates
  • rotel_python_processor_sdk/processors/genai_redaction_processor.py - New PII redaction processor
  • rotel_python_processor_sdk/python_tests/genai_redaction_processor_test.py - Tests for redaction processor

- Add --materialize-genai-fields flag to DDL tool for creating materialized
  columns (GenAIOperationName, GenAIProviderName, token counts, etc.)
- Add --partition-by-service-name flag for multi-tenant optimization
- Add genai_redaction_processor.py for PII redaction in LLM messages
- Update documentation with usage examples
Copy link
Contributor

@rjenkins rjenkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brightsparc, did a quick skim and this looks pretty good! I'm going to dig into the Python code tomorrow and should be able to finish review.

Question on the ddl, do you know if there has been any changes on the otel collector side in regards to schema for CH exporter with genai changes? I checked here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates, but doesn't look like anyone has done any work on this yet there either. Either way excited to get this in.

@mheffner
Copy link
Contributor

mheffner commented Jan 5, 2026

Can you merge from main to include this PR: #256? That should allow us to schedule the status checks

@brightsparc
Copy link
Author

Question on the ddl, do you know if there has been any changes on the otel collector side in regards to schema for CH exporter with genai changes? I checked here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates, but doesn't look like anyone has done any work on this yet there either. Either way excited to get this in.

I'm not aware of any specific new fields at the table level, most of these changes are under the span attributes AFAIK.

Copy link
Contributor

@rjenkins rjenkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brightsparc. I left a comment on the processor in regards to overlap with the existing redaction processor, would like to hear your thoughts on that. Additionally I'm curious as to how you are using/querying the data with these changes today. Are you using ClickStack or are you directly querying ClickHouse in some other manner?

The other big change here is the schema changes to CH. Nice job on making them "opt-in" via args on clickhouse-ddl. I'm reaching out to our friends at Clickhouse to get their thoughts on these, so give me a day or two to get back to you.

From a high level essentially there are two things we should consider.

  1. What is the performance impact on insert for the materialized columns and partition/order by changes. As mentioned it's opt-in, so not necessarily a blocker but there is likely additional load on the CH server and theoretically a drop in insert performance. Ideally we'd quantify that and make a note in the README.md about the perf impact. On the flip side the query performance will be better for aligned queries.

We can likely help with some of the perf analysis if you can share some code to generate synthetic spans with the related genai fields or perhaps even more useful we could modify our load generator https://github.com/streamfold/otel-loadgen to include these new fields and we can run some benchmarks.

  1. Does this have an impact on ClickStack? I suspect ClickStack will ignore these additional columns, so might not be an issue but we should test and verify we don't break anything with their UI.

If we find that these additional columns are very interesting from the ClickHouse folks we may want to consider more closely how we generate these (materialized or transformed natively in Rotel's CH exporter) and whether we should add these to the default OTel traces table or if their better suited in a seperated GenAI traces table.

@@ -0,0 +1,463 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brightsparc - Thanks for this processor, it seems like the bulk improvements here are some DX around handling redaction as it related items with the tool_call prefix within the "gen_ai.input.messages" and "gen_ai.output.messages"? This seems useful to me and we definitely want to add more out-of-the-box processors so we might land this as-is, however I'm curious if you looked at the generic redaction processor here: https://github.com/streamfold/rotel/blob/main/rotel_python_processor_sdk/processors/redaction_processor.py.

Seems like there is some fair amount of overlap. The redaction_processor and its configuration arguments are based loosely on the similar go processor. Would some of these changes slot into the existing redaction processor as additional configuration options? WDYT about the overlap and what was missing from the redaction processor that makes this genai_redaction_processor easier to use?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't sure if there was a way to handle the same sort of genai specific redaction with the existing out of the box solution. I implemented this as more of a POC to see if I could include as part of the broader GenAI materialization of properties. Also I could see value in having an option to use presidio for more sophisticated scrubbing.

I'm happy to drop this from the PR if it's not adding value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presidio looks neat. We don't need to drop this and agreed it's probably a good idea to keep it separate for now, (like a POC as you say), so we can iterate on it without modifying or worrying about impact or accidental breakage to the generic redaction processor. I'm just wondering if longer term the capabilities will overlap and we can consolidate. With that said, we want to grow the processor library so overlap is not major concern at the moment and if we find later users are confused as to "which processor to use" we can consolidate.

"ORDER_BY",
&get_order_by("(ServiceName, SpanName, toDateTime(Timestamp))", engine),
),
("PARTITION_BY", &get_partition_by(partition_expr, engine)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We need to review performance impact on insert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here was to be able to support multi-tenant querying of otel records, was going to do this by the service.name, but open to other/better ways of adding partioning / indexing.

// Uses SpanAttributes['key'] syntax and casts for numeric values
const GENAI_MATERIALIZED_MAP_SQL: &str = r#"
-- GenAI MATERIALIZED columns (extracted from SpanAttributes Map)
GenAIConversationId String MATERIALIZED SpanAttributes['gen_ai.conversation.id'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to review performance impact on insert.

@brightsparc
Copy link
Author

brightsparc commented Jan 6, 2026

We can likely help with some of the perf analysis if you can share some code to generate synthetic spans with the related genai fields or perhaps even more useful we could modify our load generator https://github.com/streamfold/otel-loadgen to include these new fields and we can run some benchmarks.

I wasn't aware of this project, I think it could be good to simulate a combination of various multiple turn, tool calling scenarios etc, the idea of adding the indexes was to be able quickly get to the last message in a multi-turn conversation instead of having to query the whole large json object.

If we find that these additional columns are very interesting from the ClickHouse folks we may want to consider more closely how we generate these (materialized or transformed natively in Rotel's CH exporter) and whether we should add these to the default OTel traces table or if their better suited in a seperated GenAI traces table.

I initially started this PR by looking to transform the objects with a genai specifc exporter, but decided to go this route instead which put the heavy lifting back onto clickhouse which felt like a more optimal approach, but would be interesting to benchmark.

@rjenkins
Copy link
Contributor

rjenkins commented Jan 6, 2026

We can likely help with some of the perf analysis if you can share some code to generate synthetic spans with the related genai fields or perhaps even more useful we could modify our load generator https://github.com/streamfold/otel-loadgen to include these new fields and we can run some benchmarks.

I wasn't aware of this project, I think it could be good to simulate a combination of various multiple turn, tool calling scenarios etc, the idea of adding the indexes was to be able quickly get to the last message in a multi-turn conversation instead of having to query the whole large json object.

If we find that these additional columns are very interesting from the ClickHouse folks we may want to consider more closely how we generate these (materialized or transformed natively in Rotel's CH exporter) and whether we should add these to the default OTel traces table or if their better suited in a seperated GenAI traces table.

I initially started this PR by looking to transform the objects with a genai specifc exporter, but decided to go this route instead which put the heavy lifting back onto clickhouse which felt like a more optimal approach, but would be interesting to benchmark.

Thanks for the context. I will follow up with more thoughts on the schema changes in the coming days. RE: the otel-loadgen, if you want to open a PR to simulate a combination of various multiple turn, tool calling scenarios that would be 👍

@rjenkins
Copy link
Contributor

rjenkins commented Jan 6, 2026

Hey @brightsparc we're +1 on landing this but would like to get some benchmarks on our side first. We have a harness to run the benches, but would you be willing to open a PR for https://github.com/streamfold/otel-loadgen that generates synthetic data for materialized columns. We can then run benchmarks on our end to see what the performance looks like, update the README.md with any relevant data and get this merged.

Also, it might be nice to do a blog post about if you're interested. We should be able to configure ClickStack to view the new materialized columns and we could discuss some use cases if you're interested in collaborating on a post.

@brightsparc
Copy link
Author

Hey @brightsparc we're +1 on landing this but would like to get some benchmarks on our side first. We have a harness to run the benches, but would you be willing to open a PR for https://github.com/streamfold/otel-loadgen that generates synthetic data for materialized columns. We can then run benchmarks on our end to see what the performance looks like, update the README.md with any relevant data and get this merged.

it might be worth using a public open dataset that has a number of multi turn inputs / outputs. A good example is the sales force APIGen dataset that has 5k examples for multi turn.
https://huggingface.co/datasets/Salesforce/APIGen-MT-5k

They also have an older dataset of 60k examples that were typically single turn although with multiple tool calls.
https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k

you could use this as a source to create a decent column of data that has input and output messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants